Skip to content

[pull] master from DataDog:master#533

Merged
pull[bot] merged 5 commits into
ConnectionMaster:masterfrom
DataDog:master
May 12, 2026
Merged

[pull] master from DataDog:master#533
pull[bot] merged 5 commits into
ConnectionMaster:masterfrom
DataDog:master

Conversation

@pull

@pull pull Bot commented May 12, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

HadhemiDD and others added 5 commits May 12, 2026 14:34
* fix teamcity tests failure

* pin mockserver version to 5.15
* Add GPU cost overview dashboards

Adds 5 dashboards to the GPU integration for monitoring GPU compute
spend and utilization across cloud providers and Kubernetes.

Dashboards:
- gpu_cost_overview: cross-cloud totals, spend by team/env, fleet utilization
- aws_gpu_cost_overview: AWS-specific with Capacity Block tracking
- azure_gpu_cost_overview: Azure GPU VM families (NC, NCv3, ND series)
- gcp_gpu_cost_overview: GCP GPU SKUs with On-Demand vs Committed coverage
- k8s_gpu_cost_overview: cluster/namespace allocation, idle cost attribution

Cost queries span cloud_cost amortized metrics plus unblended for
AWS Capacity Blocks (which are not captured in amortized). Utilization
widgets join GPU telemetry (gpu.sm_active, gpu.device.total) with
cost data for unit-economics views.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Refine GPU utilization metrics and dashboard polish

- Switch unhealthy KPIs to GPU Idle % (gr_engine_active) since
  gpu.device.unhealthy is non-functional
- Add Healthy GPU Rate KPI on k8s using kubernetes_state.node.gpu_allocatable
  / gpu_capacity
- Add Spend on Idle GPUs KPI to per-cloud and overview dashboards
  (excludes AWS Capacity Blocks since their cost is upfront and unrelated
  to engine activity)
- Standardize utilization terminology: "Average GPU Utilization %" and
  "GPU Idle %" across dashboards
- Drop redundant cloud provider prefix from per-cloud widget titles and
  remove the "GPU Spend" group wrapper to match k8s dashboard layout

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add GPU Monitoring setup CTA banner to per-cloud and k8s dashboards

Mirrors the existing banner from the overview dashboard so users on any
cloud-specific or Kubernetes view see the link to enable GPU Monitoring,
which populates utilization-driven widgets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…23609)

* fix(nutanix): normalize all ntnx_* tags to lowercase with $unknown fallback

Route every enum-backed ntnx_* tag (host_type, hypervisor_type,
node_status, plus the previously-handled state tags) through a single
_norm_state helper so they all follow one rule: lowercase the API
value, fall back to "\$unknown" when the source is missing.

Picks "\$unknown" (the API spec's own sentinel) as the fallback so
there's no mismatch between "value present but says \$UNKNOWN" and
"value missing" — both surface as ntnx_X:\$unknown.

ntnx_disk_status's "unknown" fallback is updated to "\$unknown" for
the same reason.

* docs(nutanix): add changelog for tag normalization

* docs(nutanix): shorten changelog to one customer-facing line

* refactor(nutanix): extract tag values into named variables

Hoist _norm_state and get_nested calls out of f-strings in the
tag-extraction helpers. Each tag computation now binds to a named
local first, making the read top-down and easier to step through.

* refactor(nutanix): rename _norm_state to _normalize_tag_value

The helper is used for type, state, mode, and status tags — not just
state — so the broader name better describes what it does.

* refactor(nutanix): collapse node_status to one variable, restore tags = []

Lowercase the node-status comparison sets so the normalized tag value
serves both the status_value lookup and the tag emission, removing the
need for a separate node_status_tag local.

Restore the tags = [] preamble in the tag-extraction helpers since it
makes the building intent obvious.

* fix(nutanix): normalize powerState in vm.status gauge

Match what _report_host_status_metrics does: route the powerState
lookup through _normalize_tag_value and lowercase the comparison
literals. Removes the asymmetry where vm.status was the only metric
still relying on raw uppercase API values.

Addresses review feedback on PR #23609.

* refactor(nutanix): normalize disk statuses in _aggregate_disk_status

Lowercase DEGRADED_DISK_STATUSES at the constant and route disk
``status`` values through ``_normalize_tag_value`` so the comparison
surface matches the rest of the module. Aligns with the convention
introduced in ``_report_host_status_metrics``.

* test(nutanix): cover \$unknown fallback for host enum tags

Verify ``ntnx_host_type``, ``ntnx_hypervisor_type``, and
``ntnx_node_status`` emit ``\$unknown`` when ``hostType``,
``hypervisor.type``, and ``nodeStatus`` are absent from the host
payload, the always-emit behavior introduced in this PR.

* test(nutanix): align unknown-fallback test with repo conventions
delete 'pause_auto_refresh'
@pull pull Bot locked and limited conversation to collaborators May 12, 2026
@pull pull Bot added the ⤵️ pull label May 12, 2026
@pull pull Bot merged commit 6ef0218 into ConnectionMaster:master May 12, 2026
1 check passed
@pull pull Bot temporarily deployed to release May 12, 2026 23:37 Inactive
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants